What are Sankey Plots?

  • Flow diagram
  • Width of flows = amount
  • Named after Matthew Henry Phineas Riall Sankey
  • Flow and distribution of heat in steam engines, 1898

https://upload.wikimedia.org/wikipedia/commons/1/10/JIE_Sankey_V5_Fig1.png

History

  • First and most famous Sankey plot: Charles Minard's Map of Napoleon's Russian Campaign of 1812
  • Sankey diagram on map
  • Created 1869
  • Many data types

https://upload.wikimedia.org/wikipedia/commons/2/29/Minard.png

How are they used?

  • Energy: input, output, waste
  • International Energy Agency: Flow of energy for the entire planet from 1973 to 2019
  • Interactive: Show change of energy flow for selected country and years

https://www.iea.org/sankey/#?c=World&s=Balance

How are they used?

  • Eurostat: Interactive energy balance flow for EU or countries of EU from 1990 to 2020

https://ec.europa.eu/eurostat/web/energy/energy-flow-diagrams

How are they used?

  • Voter flow in elections

https://www.tagesschau.de/inland/btw21/waehlerwanderung-bundestagswahl-103.html

How are they used?

  • Voter flow in elections

https://www.tagesschau.de/inland/btw21/waehlerwanderung-bundestagswahl-103.html

How are they used?

  • Voter flow in elections

https://www.tagesschau.de/inland/btw21/waehlerwanderung-bundestagswahl-103.html

How are they used?

  • Voter flow in elections

https://www.economist.com/graphic-detail/2019/11/01/a-british-election-and-other-uncertainties

How should/shouldn't they be used?

  • Show flow of different categories between two or more distributions
    • Start: Initial distribution of categories (usually the left side)
    • Flow: Redistribution from left to the right
    • End: New distribution (usually the right side)
  • Width of lines represent the volume or amount
  • Avoid too many categories

Python libraries for Sankey plots

Data

https://download.statistik-berlin-brandenburg.de/0c8e82331bc2327a/802f7f020114/SB_A01-03-00_2020j01_BE.xlsx

In [2]:
data = {
    '2020': {
        'start': 3669491,
        'births': 38693,
        'immigration': 142923,
        'deaths': -37642,
        'emmigration': -144881,
        'end': -3664088
    }
}

flows = list(data['2020'].values())
labels = list(data['2020'].keys())
flows, labels
Out[2]:
([3669491, 38693, 142923, -37642, -144881, -3664088],
 ['start', 'births', 'immigration', 'deaths', 'emmigration', 'end'])

matplotlib

In [3]:
import matplotlib.pyplot as plt
from matplotlib.sankey import Sankey
In [4]:
sankey = Sankey()  # init
sankey.add(flows=flows, labels=labels)  # add flow(s)
sankey.finish()  # create
plt.show()  # show
In [5]:
scale = 0.0000001
sankey = Sankey(scale=scale)  # init with scale!
sankey.add(flows=flows, labels=labels)
sankey.finish()
plt.show()
In [6]:
sankey = Sankey(scale=scale)

# 0 (inputs from the left, outputs to the right),
# 1 (from and to the top) or -1 (from and to the bottom).
orientations = [0, -1, 1, -1, 1, 0]

# add flow(s) with orientations
sankey.add(flows=flows, labels=labels, orientations=orientations)
sankey.finish()
plt.show()
In [7]:
pathlengths=[0.1, 0.1, 0.1, 0.1, 0.1, 0.1]

sankey = Sankey(scale=scale)
sankey.add(
    flows=flows, labels=labels,
    orientations=orientations,
    pathlengths=pathlengths,
)  # add flow(s) with orientations and pathlengths
sankey.finish()
plt.show()
In [8]:
def format_number(n):
    return '{:,}'.format(n)  # add thousand separator

# add number format
sankey = Sankey(scale=scale, format=format_number)
sankey.add(
    flows=flows, labels=labels,
    orientations=orientations,
    pathlengths=pathlengths,   
)
sankey.finish()
plt.show()
In [9]:
sankey = Sankey(scale=scale, format=format_number)
sankey.add(
    flows=flows, labels=labels,
    orientations=orientations,
    pathlengths=pathlengths,
    facecolor='lightgray'  # change color
)
sankey.finish()
plt.title("Berlin Census 2020")  # add title
plt.show()
In [10]:
# add second year
data = {
    '2019': {
        'start 2019': 3644826,
        'births': 39503,
        'immigration': 184744,
        'deaths': -34739,
        'emmigration': -161513,
        'end 2019': -3669491
    },
    '2020': {
        'start 2020': 3669491,
        'births': 38693,
        'immigration': 142923,
        'deaths': -37642,
        'emmigration': -144881,
        'end 2020': -3664088
    }
}
In [11]:
flows_2019 = list(data['2019'].values())
labels_2019 = list(data['2019'].keys())
labels_2019[-1] = None  # remove last label
flows_2020 = list(data['2020'].values())
labels_2020 = list(data['2020'].keys())
In [12]:
pathlengths=[0.3, 0.3, 0.1, 0.1, 0.3, 0.3]

sankey = Sankey(scale=scale, format=format_number)
sankey.add(
    flows=flows_2019, labels=labels_2019,
    orientations=orientations,
    pathlengths=pathlengths,
    facecolor='lightgray'
)
sankey.add(
    flows=flows_2020, labels=labels_2020,
    orientations=orientations,
    pathlengths=pathlengths,
    prior=0, connect=(5, 0),  # connect second flow to first
    facecolor='darkgray'
)
sankey.finish()
plt.title("Berlin Census 2019 & 2020")  # add title
plt.show()

matplotlib notes

pySankey

In [13]:
import pandas as pd
from pySankey.sankey import sankey
In [14]:
# create DataFrame from 2020 data
df_2020 = pd.DataFrame([
    # start -> deaths
    {'source': 'start', 'target': 'deaths', 'value': 37642},
    # start -> emmigration
    {'source': 'start', 'target': 'emmigration', 'value': 144881},
    # start -> end
    {'source': 'start', 'target': 'end', 'value': 3669491},
    # births -> end
    {'source': 'births', 'target': 'end', 'value': 38693},
    # immigration -> end
    {'source': 'immigration', 'target': 'end', 'value': 142923},
])
df_2020
Out[14]:
source target value
0 start deaths 37642
1 start emmigration 144881
2 start end 3669491
3 births end 38693
4 immigration end 142923
In [15]:
sankey(
    left=df_2020['source'], right=df_2020['target'],
    leftWeight=df_2020['value'],
    fontsize=14,
    #figure_name="Berlin Census 2020",  # used for saving png, not title
)

pySankey notes

  • https://github.com/anazalea/pySankey
  • only code documentation
  • examples do not work after pip install
  • figure_name not in docstring: used for saving file (not title)
  • try & error
  • only one flow level possible
  • looks more like a modern Sankey plot than matplotlib
  • great for simple plots

psankey

In [16]:
from psankey.sankey import sankey
In [17]:
nodes, fig, ax = sankey(
    df_2020, aspect_ratio=4/3,
    nodelabels=True, linklabels=True, labelsize=5,
)
plt.title("Berlin Census 2020")  # add title
plt.show()

psankey notes

  • https://github.com/mandalsubhajit/psankey
  • works with pd.DataFrames
  • works for multiple flow levels
  • short but helpful documentation in README.md
  • some smart options
    • nodemodifier to highlight nodes
  • node positions not customizable

holoviews

In [18]:
import holoviews as hv
from holoviews import opts, dim
hv.extension('bokeh')
width, height = 600, 400
In [19]:
# run example code
sankey = hv.Sankey([
    ['A', 'X', 5], ['A', 'Y', 7], ['A', 'Z', 6],
    ['B', 'X', 2], ['B', 'Y', 9], ['B', 'Z', 4]
])
sankey.opts(width=width, height=height)
Out[19]:
In [20]:
# pass DataFrame from previous example
sankey = hv.Sankey(df_2020)
sankey.opts(width=width, height=height)
Out[20]:
In [21]:
# create DataFrame from 2019 & 2020 data
df = pd.DataFrame([
    # 2019
    {'source': '2019', 'target': '2020', 'value': 3644826, 'color': 'lightgray'},
    {'source': '2019', 'target': 'deaths `19', 'value': 34739, 'color': '#a6cee3'},
    {'source': '2019', 'target': 'emmigration `19', 'value': 161513, 'color': '#1f78b4'},
    {'source': 'births `19', 'target': '2020', 'value': 39503, 'color': '#b2df8a'},    
    {'source': 'immigration `19', 'target': '2020', 'value': 184744, 'color': '#33a02c'},
    # 2020
    {'source': '2020', 'target': '2021', 'value': 3669491, 'color': 'lightgray'},
    {'source': '2020', 'target': 'deaths `20', 'value': 37642, 'color': '#a6cee3'},
    {'source': '2020', 'target': 'emmigration `20', 'value': 144881, 'color': '#1f78b4'},
    {'source': 'births `20', 'target': '2021', 'value': 38693, 'color': '#b2df8a'},
    {'source': 'immigration `20', 'target': '2021', 'value': 142923, 'color': '#33a02c'},
])
df.head(3)
Out[21]:
source target value color
0 2019 2020 3644826 lightgray
1 2019 deaths `19 34739 #a6cee3
2 2019 emmigration `19 161513 #1f78b4
In [22]:
sankey = hv.Sankey(df)
sankey.opts(width=width, height=height, cmap='Set2',
            edge_color=dim('source').str(),
            node_color=dim('target').str())
Out[22]:

holoviews notes

plotly

In [23]:
import plotly.graph_objects as go
In [24]:
# example from https://plotly.com/python/sankey-diagram/
fig = go.Figure(data=[go.Sankey(
    node = dict(
      pad = 15,
      thickness = 20,
      line = dict(color="black", width=0.5),
      label = ["A1", "A2", "B1", "B2", "C1", "C2"],
      color = "blue"
    ),
    link = dict(
      # indices correspond to labels, eg A1, A2, A1, B1, ...
      source = [0, 1, 0, 2, 3, 3],
      target = [2, 3, 3, 4, 4, 5],
      value = [8, 4, 2, 8, 4, 2]
  ))])
In [25]:
fig.update_layout(
    title_text="Basic Sankey Diagram",
    width=width, height=height, font_size=10)
fig.show()
In [26]:
# create nodes with index from DataFrame
# https://stackoverflow.com/a/69464558
import numpy as np
nodes = np.unique(df[["source", "target"]], axis=None)
nodes = pd.Series(index=nodes, data=range(len(nodes)))
nodes
Out[26]:
2019                0
2020                1
2021                2
births `19          3
births `20          4
deaths `19          5
deaths `20          6
emmigration `19     7
emmigration `20     8
immigration `19     9
immigration `20    10
dtype: int64
In [27]:
fig = go.Figure(
    data=[
        go.Sankey(
        node={
            "label": nodes.index,
        },
        link={
            "source": nodes.loc[df["source"]],
            "target": nodes.loc[df["target"]],
            "value": df["value"],
        })
    ]
)
In [28]:
fig.update_layout(
    title_text="Berin Census 2019 & 2020",
    width=width, height=height, font_size=10)
fig.show()
In [29]:
x = [.1, .4, .7,  # years
     .1, .4,  # births
     .3, .6,  # deaths
     .3, .6,  # emmigration
     .1, .4,  # immigration
]
y = [.5, .5, .5,  # years
     .75, .8,  # births
     .2, .25,  # deaths
     .25, .3,  # emmigration
     .7, .75,  # immigration
]
color = ["darkgray", "darkgray", "darkgray",
         "#b2df8a", "#b2df8a",  # light green
         "#a6cee3", "#a6cee3",  # light blue
         "#1f78b4", "#1f78b4",  # dark blue
         "#33a02c", "#33a02c",  # dark green
]
x, y
Out[29]:
([0.1, 0.4, 0.7, 0.1, 0.4, 0.3, 0.6, 0.3, 0.6, 0.1, 0.4],
 [0.5, 0.5, 0.5, 0.75, 0.8, 0.2, 0.25, 0.25, 0.3, 0.7, 0.75])
In [30]:
fig = go.Figure(
    data=[
        go.Sankey(
        arrangement = "freeform",
        node={
            "label": nodes.index,
            "x": x,
            "y": y,
            "pad": 100,  # padding between nodes,
            "color": color,
        },
        link={
            "source": nodes.loc[df["source"]],
            "target": nodes.loc[df["target"]],
            "value": df["value"],
            "color": df["color"],
        })
    ]
)
In [31]:
fig.update_layout(
    title_text="Berin Census 2019 & 2020", font_size=10
)
fig.show()

plotly notes

Takeaways

  • Check if your data fits for sankey
    • flows not proportions
    • not too many categories
  • Make sure it is worth the effort
    • Get data into the correct format
    • Customizing the plot
  • Think about colors!
  • What do you want to show?
  • Search for good examples to guide you

Do

https://upload.wikimedia.org/wikipedia/commons/2/29/Minard.png

Do

https://www.economist.com/graphic-detail/2019/11/01/a-british-election-and-other-uncertainties

Do

https://www.ipoint-systems.com/blog/from-data-to-knowledge-the-power-of-elegant-sankey-diagrams/

Don't

https://www.sankey-diagrams.com/intra-eu-horse-meat-trade/

Don't

https://www.sankey-diagrams.com/how-not-to-sankey/